Harnessing large language models (LLMs) for candidate gene prioritization and selection

Background Feature selection is a critical step for translating advances afforded by systems-scale molecular profiling into actionable clinical insights. While data-driven methods are commonly utilized for selecting candidate genes, knowledge-driven methods must contend with the challenge of efficiently sifting through extensive volumes of biomedical information. This work aimed to assess the utility of large language models (LLMs) for knowledge-driven gene prioritization and selection. Methods In this proof of concept, we focused on 11 blood transcriptional modules associated with an Erythroid cells signature. We evaluated four leading LLMs across multiple tasks. Next, we established a workflow leveraging LLMs. The steps consisted of: (1) Selecting one of the 11 modules; (2) Identifying functional convergences among constituent genes using the LLMs; (3) Scoring candidate genes across six criteria capturing the gene’s biological and clinical relevance; (4) Prioritizing candidate genes and summarizing justifications; (5) Fact-checking justifications and identifying supporting references; (6) Selecting a top candidate gene based on validated scoring justifications; and (7) Factoring in transcriptome profiling data to finalize the selection of the top candidate gene. Results Of the four LLMs evaluated, OpenAI's GPT-4 and Anthropic's Claude demonstrated the best performance and were chosen for the implementation of the candidate gene prioritization and selection workflow. This workflow was run in parallel for each of the 11 erythroid cell modules by participants in a data mining workshop. Module M9.2 served as an illustrative use case. The 30 candidate genes forming this module were assessed, and the top five scoring genes were identified as BCL2L1, ALAS2, SLC4A1, CA1, and FECH. Researchers carefully fact-checked the summarized scoring justifications, after which the LLMs were prompted to select a top candidate based on this information. GPT-4 initially chose BCL2L1, while Claude selected ALAS2. When transcriptional profiling data from three reference datasets were provided for additional context, GPT-4 revised its initial choice to ALAS2, whereas Claude reaffirmed its original selection for this module. Conclusions Taken together, our findings highlight the ability of LLMs to prioritize candidate genes with minimal human intervention. This suggests the potential of this technology to boost productivity, especially for tasks that require leveraging extensive biomedical knowledge. Supplementary Information The online version contains supplementary material available at 10.1186/s12967-023-04576-8.


Scoring and prioritization
Genes were scored on six criteria using GPT-4 and Claude.The scores were averaged, and candidate genes ranked according to their cumulative scores (Figure 1, Methods: Step 3 and Step 4).
The two LLMs were requested to score each gene on the following six statements: a.The gene is associated with erythroid cells or erythropoiesis.b.The gene is currently being used as a biomarker in clinical settings.c.The gene has potential value as a blood transcriptional biomarker.d.The gene is relevant to circulating leukocytes immune biology.e.The gene is a known drug target.f.The gene is therapeutically relevant for immune-mediated diseases.The top five genes selected for further evaluation were: PBX1, ADIPOR1, BLVRB, CARM1 and GLUL (Figure 2).

Knowledge-driven evaluation of top five candidates
Justifications given by the GPT-4 and Claude for the scores provided across the 6 criteria were compiled and summarized by GPT-4 (Methods: Step 5).For each statement pertinent references were retrieved using GPT-4 or Claude, with the relevance of each reference checked and attributed manually by researcher authoring this report.
Function: "The PBX1, ADIPOR1, BLVRB, CARM1, and GLUL genes encode proteins integral to a variety of biological processes.The PBX1 gene codes for a transcription factor that plays a critical role in embryonic development and cell differentiation, significantly contributing to organogenesis through the regulation of gene expression patterns (4,5).ADIPOR1 encodes a receptor protein for adiponectin, a hormone essential for glucose levels regulation and fatty acid breakdown.This receptor primarily expresses in skeletal muscle tissue, suggesting its key role in metabolic regulation (6,7,8,9).The BLVRB gene is central to heme metabolism, converting biliverdin into bilirubin, a potent antioxidant, thereby aiding in cellular protection against oxidative stress (10,11) The protein encoded by the CARM1 gene functions as a histone arginine methyltransferase and a transcriptional coactivator for several nuclear receptors, including the estrogen receptor alpha.This protein contributes significantly to cellular functions such as growth and differentiation (12,13).Finally, the GLUL gene codes for the Glutamate-Ammonia Ligase enzyme, which is instrumental in nitrogen metabolism by catalyzing the formation of glutamine from glutamate and ammonia, thereby regulating glutamine metabolism ( 14)".a. Relevance to erythroid cells and erythropoiesis: "Preliminary research indicates potential roles of PBX1, BLVRB, CARM1, ADIPOR1, and GLUL in erythropoiesis, albeit these roles remain largely undefined, highlighting the need for further investigation.PBX1, implicated in hematopoietic development and potentially erythropoiesis, requires additional research to elucidate its precise function in this domain (4,5).Similarly, BLVRB, crucial in heme catabolism, could play a significant role in erythropoiesis given its importance in erythroid cells, yet the specifics of this association are still to be determined (15,59).The link between CARM1 and erythropoiesis is only suggested by limited evidence, and further exploration is necessary to discern the magnitude and characteristics of its involvement (16).Furthermore, while not traditionally associated with erythropoiesis, initial findings suggest potential roles of ADIPOR1 and GLUL in this process.The precise contributions and involvement of these two genes, however, remain uncertain and require comprehensive substantiation through detailed studies.Thus, the exact nature of these genes' roles and relationships with erythroid cells needs to be accurately defined through rigorous validation".b.Is used as a clinical biomarker: "Although preliminary evidence points to potential roles of PBX1, ADIPOR1, BLVRB, CARM1, and GLUL as prognostic markers in various diseases, their widespread adoption as clinical biomarkers remain unrealized as of 2021.PBX1, associated with certain cancers such as breast cancer and leukemia, requires further validation before it can be broadly applied in oncological clinical settings (17,18).Similarly, ADIPOR1, suggested as a biomarker in metabolic disorders, obesity, and type 2 diabetes, needs thorough validation before routine integration into clinical biomarker panels (19,20,21,22).Despite its hypothetical roles in various biological processes, BLVRB's use as a clinical biomarker is not yet established, highlighting the necessity for further exploration (23).While CARM1 is implicated in numerous cellular processes, its integration as a clinical biomarker is still pending, necessitating additional research (24,25).Lastly, while studies suggest a potential role for GLUL as a biomarker in certain cancers and liver diseases, it is not yet adopted as a standard clinical biomarker, underscoring the need for further comprehensive validation".c. Potential relevance as a blood transcriptional biomarker: "While PBX1, ADIPOR1, BLVRB, CARM1, and GLUL all have established roles in various biological processes, their utilization as blood transcriptional biomarkers is still not definitively established due to limited evidence.PBX1's recognized role in hematopoietic development hints at its potential as a biomarker, but further research and validation are needed (5,26).Similarly, despite preliminary findings suggesting the utility of ADIPOR1 as a biomarker in metabolic diseases, comprehensive validation is necessary (21).BLVRB's role in heme metabolism could theoretically make it a useful biomarker, but the current evidence supporting this is insufficient, demanding further investigations (27).Although CARM1's function in transcriptional regulation might suggest it as a potential biomarker, definitive proof is still lacking, necessitating further research.Lastly, despite correlations between GLUL expression and certain diseases, existing evidence for its utility as a transcriptional biomarker is limited, indicating the need for further validation and extensive research (28)".d.Relevance to leukocytes immune biology: "PBX1, known for its significant role in hematopoietic development, is intrinsically tied to leukocyte biology, given the critical involvement of leukocytes in hematopoiesis.This association emphasizes PBX1's potential relevance in understanding immune cell functions (5,29).Although preliminary evidence indicates potential links between ADIPOR1, BLVRB, CARM1, and GLUL and leukocyte biology, these relationships are not definitively established.ADIPOR1's role in regulating inflammatory responses through adiponectin suggests a connection to leukocyte biology, but direct investigations are currently lacking (30,31,32).While not directly associated with leukocytes, BLVRB may influence immune cell function indirectly through its end product, bilirubin, known for its immune-modulating properties (33).Similarly, CARM1, acknowledged for its regulatory role in immune responses (34,35) and GLUL, critical for glutamine synthesis-a crucial nutrient for immune cells-may have implications for leukocyte biology.However, all these potential links require further comprehensive studies and validation for conclusive elucidation".e.Is a known drug target: "PBX1, despite its implications in tumorigenesis, is not recognized as a primary drug target, as it lacks qualifying characteristics for direct pharmacological interventions (36,37).Similarly, ADIPOR1 (8,38,39) and BLVRB (40,41,42), although investigated for their potential roles in therapeutic interventions for metabolic disorders and various diseases respectively, are not yet considered major drug targets, with no drugs directly targeting these genes available as of 2021.This underscores the need for further research to elucidate their utility in drug development.Contrarily, CARM1, due to its crucial role in cellular growth and differentiation, especially in oncological contexts, is emerging as a potential epigenetic drug target, indicating new directions for therapeutics design (43,44).Finally, despite investigations into GLUL's potential as a therapeutic target for specific cancers and neurodegenerative disorders, its broad recognition as a direct drug target is still pending as of 2021, highlighting the importance of ongoing research in this area (45)".f.Potential therapeutic relevance for immune-mediated diseases: "PBX1, recognized for its role in hematopoietic development and association with hematologic malignancies, particularly leukemias, exhibits potential therapeutic relevance in immune-mediated diseases.However, this potential necessitates further research for full elucidation (5,46,47).Similarly, ADIPOR1, the receptor for adiponectin involved in anti-inflammatory processes, suggests therapeutic applicability in immune-mediated diseases, yet requires additional evidence for solid validation (8,31,48,49).BLVRB, contributing to the production of antioxidant and antiinflammatory bilirubin, also implies potential therapeutic relevance for immune-mediated diseases, but the current evidence is preliminary and inconclusive.Consequently, additional investigation is necessary to establish this connection robustly (10,50,51).CARM1, known for its involvement in immune responses, shows potential therapeutic relevance for immunemediated diseases, although existing research is not conclusive, warranting further investigation (52,53).Finally, the GLUL gene, responsible for glutamine synthesis crucial to immune cell function, suggests potential therapeutic applicability for immune-mediated diseases.Despite some initial evidence, the association remains unconfirmed, highlighting the need for further comprehensive studies for validation and exploration of its therapeutic potential (54)".

Examining expression patterns of top 5 candidates across leukocyte populations
The expression patterns of the top 5 candidate genes were examined across diverse leukocyte populations and hematopoietic precursors using two reference transcriptomic datasets.

BloodGen3 reference cohorts (GSE100150)
Figure 5: The box plot in panel A shows the changes in abundance of transcripts comprising the M13.30 module, expressed within a cohort as the percentage of genes showing differential expression between each patient and the respective set of control subjects for this cohort.Differential expression is based on fixed cut-off (Log2 Fold Change >1.5, difference>10).These values are expressed as % module activity and can range from 100% (all genes are upregulated) to -100% (all genes are downregulated).The box plot in panel B show expression levels for the top 5 scoring M13.30 genes in patient cohorts relative to respective control subjects matched for demographic variables.

Overall pick(s) for M13.30
We aim to establish a universal immune profiling assay that measures transcript abundance of a panel of genes in whole blood using small-or meso-scale molecular profiling technology.Compelling candidates identified here from the M13.30 selection pool will be subjected to more in-depth evaluation using a pre-established workflow (57,58).
Claude and GPT-4 were prompted to select a top candidate on the basis of the information summarized above in section 4, taking into account all six scoring criteria equally (Methods: Step 7).

Claude chose BLVRB:
Here is a summary of the key conclusions regarding the selection of a candidate gene for inclusion in a targeted blood transcriptional profiling panel: Initial Summary: • Based on preliminary evidence, BLVRB was recommended due to its role in heme metabolism and potential relevance to erythropoiesis.
Leukocyte RNA-seq Data: • RNA-seq data showing highest BLVRB expression in whole blood and monocytes supported its erythroid associations and biomarker potential.

Leukocyte Microarray Data:
• Microarray data revealed high BLVRB expression specifically in erythroid precursor cells, confirming its erythroid specificity.
Patient Cohort RNA Abundance Data: • Moderate decreases in BLVRB RNA levels across immune-related disease states provided further evidence for its usefulness as a broad biomarker reflecting immune activity.
In summary, BLVRB was consistently recommended throughout the process based on multiple lines of evidence indicating its high expression in erythroid lineages and involvement in heme metabolism, along with detectable expression changes in immune cell samples and disease cohorts.The data sources all strengthen the case for BLVRB as a top candidate gene meeting both the erythroid and immune biology goals for the targeted blood profiling panel.

GPT-4 chose PBX1:
Certainly!Here are the key conclusions we can draw from the data and conversation so far: 1. Gene Overview: • PBX1 is associated with the formation and function of blood cells.
• ADIPOR1 is linked with metabolism and has a role in immune function.
• BLVRB is involved in cell proliferation, differentiation, and apoptosis, with specific importance in early erythroid differentiation.• CARM1 is linked with transcriptional regulation, cell growth, and apoptosis.
• GLUL plays a role in nitrogen metabolism and is also crucial for the brain's neurotransmission.

RNA Abundance in Different Conditions:
• For PBX1: Elevated RNA abundance in conditions like B-cell deficiency, metastatic melanoma, active pulmonary tuberculosis, acute respiratory syncytial virus infection, and sepsis.Decreased RNA abundance is seen in conditions such as chronic obstructive pulmonary disease, acute influenza, HIV infection, juvenile dermatomyositis, and multiple sclerosis.• For ADIPOR1: Elevated RNA abundance in metastatic melanoma and acute respiratory syncytial virus infection.A significant decrease is observed in HIV infection and Kawasaki disease.• For BLVRB: Elevated RNA abundance in conditions like B-cell deficiency and metastatic melanoma.Significant decrease is seen in acute influenza, HIV infection, and Kawasaki disease.• For CARM1: Elevated RNA abundance in conditions like juvenile dermatomyositis and metastatic melanoma.A decrease is seen in conditions such as HIV infection, Kawasaki disease, and sepsis.• For GLUL: Elevated RNA abundance in conditions like juvenile dermatomyositis.
A significant decrease is observed in HIV infection.

General Observations:
• Most genes show variable RNA abundance patterns across different conditions, implying that their expression is modulated in response to various health states.• Diseases like HIV infection often result in decreased RNA abundance across multiple genes, suggesting a broad impact on transcriptional activity.• Some genes like CARM1 and ADIPOR1 have increased expression in specific conditions like metastatic melanoma, hinting at possible roles in disease pathology or as potential biomarkers.
Recommendation for Top Candidate: Based on the summarized information, if the goal is to select a gene that is most responsive across a wide range of conditions (both upregulated and downregulated), PBX1 stands out.It shows a broad range of responses, being affected in conditions like B-cell deficiency, metastatic melanoma, HIV infection, and more.This variability suggests that PBX1 might be involved in a range of physiological and pathological processes, making it a valuable inclusion in a targeted blood transcriptional profiling panel.However, the final decision should also consider the specific goals of the panel and the diseases or conditions it aims to target.
This summary provides an overview of the data presented and its potential implications.Further research and validations would be essential to establish any clinical significance.
Given its comparatively higher expression levels and restriction across leukocyte populations, the top candidate for this module is PBX1.Considering this, in conjunction with the rationale provided by both LLMs and our understanding of BLVRB and PBX1 biological and clinical relevance, we identify PBX1 as our primary M13.30 selection for subsequent characterization.
Prompts CLAUDE GPT-4 PROMPT 1: Based on the summary below, please recommend the top candidate gene to include in a targeted blood transcriptional profiling panel.The intent is for the selected gene to be representative of an erythroid cell signature associated with response to mRNA vaccines, severity of RSV infection and that found to be expressed at high levels in patients with metastatic melanoma and in liver transplant recipients.Please weigh the following criteria equally in your recommendation: 1) Relevance to erythroid cells and erythropoiesis, 2) Current use as a biomarker, 3) Potential use as a blood transcriptional biomarker, 4) Relevance to leukocyte immune biology, 5) Status as a known drug target, 6) Therapeutic relevance for immune mediated diseases: [input summarized and fact-checked justifications from Step 5] Based on the provided summary and criteria, I would recommend BLVRB as the top candidate gene to include in the targeted blood transcriptional profiling panel.
The key reasons are: • BLVRB plays a significant role in heme metabolism, which is intrinsically important for erythropoiesis and erythroid cell biology.This aligns closely with the criterion for erythroid relevance.• BLVRB may have some relevance to immune cell biology through the immune-modulating effects of its product bilirubin.• Though not an established biomarker yet, BLVRB's involvement in heme metabolism suggests potential utility as a blood transcriptional biomarker.
Given the detailed summary provided on the five candidate genes (PBX1, ADIPOR1, BLVRB, CARM1, and GLUL) and the six criteria for evaluation, we can rank the genes based on their relevance and potential as a targeted blood transcriptional profiling panel marker.
1 1.Relevance to erythroid cells and erythropoiesis: From the earlier information, all genes have potential relevance, but PBX1 and BLVRB seemed slightly more directly associated.However, this wasn't conclusive.
2. Current use as a biomarker: As of 2021, none of the genes were extensively adopted as clinical biomarkers, although there was some preliminary evidence pointing towards their potential roles.
3. Potential use as a blood transcriptional biomarker: Given the RNA-seq data, GLUL has a strong signal, especially in neutrophils, which indicates its potential as a blood transcriptional biomarker.
4. Relevance to leukocyte immune biology: PBX1 has intrinsic ties to leukocyte biology, but the high expression of GLUL, particularly in neutrophils, makes it relevant too.

Status as a known drug target:
CARM1 was noted to be emerging as an epigenetic drug target.GLUL had potential for specific cancers and neurodegenerative disorders.
6. Therapeutic relevance for immune-mediated diseases: Both GLUL and CARM1 had potential therapeutic applicability for immune-mediated diseases.
Given the combination of the RNAseq data and the initial summary, while PBX1 and BLVRB had some promising connections to erythroid cells, GLUL stands out due to its: 1. High expression levels, especially in neutrophils, suggesting its potential use as a blood transcriptional biomarker.This summary provides an overview of the data presented and its potential implications.Further research and validations would be essential to establish any clinical significance.

Figure 1 :
Figure 1: The stacked bar graph shows cumulative scores across six criteria for the 29 M13.30 genes.

Figure 3 :
Figure 3: This stacked bar graph shows the expression levels of five candidate genes.ERY3-5 populations are CD71+ GLYA+ erythroid cells.The original article provides full experimental details: (55).The web link above also provides access to more complete information about this dataset as well as individual gene profiles.

Figure 4 :
Figure 4: This stacked bar graph shows the expression levels measured by RNAseq of the five candidate genes across six leukocyte populations and whole blood.The abundance is shown for each gene in each population for an average of up to 20 samples.Additional experimental details are accessible via the link provided above and is available as part of this sample series GEO record (GSE60424) (56).